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Abstract 

It is shown that the nucleotide sequences in DNA molecules have cluster- 
scaling properties (discovered for the first time in turbulent processes: Sreeni- 
vasan and Bershadskii, 2006, J. Stat. Phys., 125, 1141-1153.). These prop- 
erties are relevant to both types of nucleotide pair-bases interactions: hydro- 
gen bonds and stacking interactions. It is shown that taking into account 
the cluster-scaling properties can help to improve heterogeneous models of 
the DNA dynamics. Two human genes: BRCA2 and NRXN1, have been 
considered as examples. 
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1. Introduction 

A DNA molecule carries information in the form of four chemical groups 
or nucleotide bases: adenine, cytosine, guanine, and thymine, represented by 
the letters A, C, G and T. The order of bases on a DNA strand is the DNA 
sequence. If we read along one of the two DNA- helix sides we get text like 
GATACA... In the double-stranded DNA, the two strands run in opposite 
directions and the bases pair up such that A always pairs with T and G 
always pairs with C. That is because these particular pairs fit exactly to form 
effective hydrogen bonds with each other. The A-T base-pair has 2 hydrogen 
bonds and the G-C base-pair has 3 hydrogen bonds. The G-C interaction is 
therefore stronger than A-T, and A-T rich regions of DNA are more prone 
to thermal fluctuations and to initiation sites (origin) at unwinding stage of 
DNA replication process. The bases are oriented perpendicular to the DNA- 
helix axis. Constant thermal fluctuations result in local twisting, stretching, 
bending, and unwinding of the double-strands. 
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Figure 1: The stacking energies for different stacked base pairs. 

In solution DNA assumes linear configuration because it is the one of 
minimum energy. The helix axis of DNA in vivo is usually strongly curved 
because the stretched length of the human genome, for instance, is about 1 
meter and this length needs to be "packaged" in order to fit in the nucleus 
of a cell (the diameter of the nucleus from a typical human somatic cell is 
about 5 x 1CT 6 meters). Therefore, the DNA has to be highly organized. This 
packaging of DNA deforms it physically, thereby increasing its energy (less 
stable than relaxed DNA, due to less than optimal base stacking). In this 
situation certain strain is relieved by supercoiling: helix bends and twists to 
achieve better base stacking orientation despite having too many bp/turn. 
The difference in A-T and G-C interactions can be used for optimizing the free 
energy. The base-pairs stacking energies (the main stabilizing factor in the 
DNA duplex, see for instance Yakovchuk et al., 2006) are highly dependent 
on the base sequence (Saenger 1984). These interactions come partly from 
the overlap of the n electrons of the bases and partly from hydrophobic 
interactions. Quantum chemistry calculations give rather different energies 
for different stacked base pairs: Fig. 1. Therefore, certain clustering of the 
base-pairs can be used by nature in order to minimize the excess energy that 
builds up when DNA molecules are deformed during the process of packaging. 

Moreover, the increase in stored (potential) energy within the molecule 
is then available to drive reactions such as the unwinding events that occur 
during DNA replication. Before replication of DNA can occur, the length 
of the DNA double helix about to be copied must be unwound and the 
two strands must be separated by breaking the hydrogen bonds that link 
the paired bases. The process of replication begins in the DNA molecules at 
thousands of sites called origins of replication. Because the location and time 
of initiation of origins is generally stochastic, the time to finish replication 
will also be a stochastic process. The random distribution of origin firing 
raises the random gap problem: a random distribution will lead to occasional 
large gaps that should take a long time to replicate. Despite this each cell 
in a population must complete the replication process in an accurate and 
timely manner (see for instance, Hyrien et al., 2003; Jun and Rhind, 2008). 
Different solutions to this problem have been suggested (see, for instance, 
Blow et al., 2001; Rhind 2006; Conti et al., 2007; Yang and Bechhoefer, 
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Figure 2: The standard deviation for 5n(r) vs r for T-dominated sub-sequence of gene 
BRCA2 (in log- log scales). The straight line (the best fit) indicates the scaling law Eq. 
(2). 



2008). If the spacing of origins is not completely random then any regularity 
in the spacing of origins will tend to suppress the large gaps (Blow et al., 
2001). For instance, origins within specific clusters could be preferred to 
fire (Mesner et al., 2003; Shechter and Gautier, 2005). Since a G-C base 
pair, with three hydrogen bonds, is expected to be harder to break than 
an A-T base pair with only two bonds, a clustering of these two kinds of 
the base-pares can be operational in order to solve the random origin firing 
problem. 

2. Cluster-scaling 

Thus, one can see that certain clustering of the base pairs can be one of 
nature's solutions for the packing and unwinding DNA problems. Because of 
many orders of space scales involved in these processes one can expect that 
the clustering will exhibit scale-invariant properties (see, for instance, Stanley 
et al., 1999; Bershadskii, 2001). A cluster-scaling for stochastic systems was 
recently suggested in a paper of Sreenivasan and Bershadskii (2006). The 
genome data can be readily checked on the cluster-scaling properties in a " 1 
or 0" mapping. In this presentation (Voss, 1992; Podobnik et al., 2007 and 
references therein) one should put A=l and C=G=T=0 in an original DNA 
sequences to obtain an A-dominated sub-sequences (one can obtain C or G, 
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or T-dominated sub-sequences in analogous way). Then, to study statistical 
clustering in sub-sequences {cij} (where Oj=l or and i = 1,2...) one should 
take running average: 

1 i=j+r 
i=j 

along the sub-sequences. For the 1 or mapping this running average will 
present a weight of the sub-sequences in interval [j,j + r]. Following to 
Sreenivasan and Bershadskii (2006) we are interested in scaling variation of 
the standard deviation of the running density fluctuations ((^(r) 2 ) 1 / 2 with 

(6 nj (r)Y 2 - (2) 

where (...) denotes average over the sub-sequences, Snj(r) = %(t) — (n(r)). 
The power law, Eq. (2), corresponds to a scale- invariant (scaling) behavior. 

The exponent a in Eq. (2) was called by Sreenivasan and Bershadskii 
(2006) as cluster-exponent. For white noise zeros (intersections of a white 
noise signal with time axis) it can be derived analytically that a = 1/2 
(see Sreenivasan and Bershadskii, 2006). This value can be considered as an 
upper limit (non-clustering case) for the cluster-exponent. If < a < 0.5 we 
have a cluster-scaling situation, and the cluster-scaling is stronger for smaller 
values of a (see for examples Sreenivasan and Bershadskii, 2006). 

In this note we will present, as an example, results obtained for hu- 
man genome. The results of computations for the genome sequences as- 
sociated with genes: BRCA2 and NRXN1, are shown in figures 2 and 3 
respectively (the full set of the genome sequences can be found in site: 
http: / /www.ncbi.nlm.nih.gov[ ) . Molecular location of gene BRCA2 on chro- 
mosome 13: base pairs 32,889,616 to 32,973,808). BRCA2 gene helps prevent 
cells from growing and dividing too rapidly or in an uncontrolled way. By 
helping repair DNA, BRCA2 plays a role in maintaining the stability of a 
cell's genetic information. Gene NRXN1 (neurexin 1) is among the largest 
known in human, molecular location on chromosome 2: base pairs 50,145,642 
to 51,259,673. NRXN1 gene represents a strong candidate for involvement 
in the etiology of nicotine dependence, and even subtle changes in NRXN1 
might contribute to susceptibility to autism. 

We show in Figs 2 and 3 results for the T-dominated sub-sequences, 
whereas the results for A, C, and G-dominated sub-sequences are similar to 
those shown in the Figs. 2 and 3 (for each gene respectively). Fig. 2 shows 
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Figure 3: The standard deviation for 5u(t) vs t for T-dominated sub-sequence of gene 
NRXN1 (in log- log scales). The straight line (the best fit) indicates the scaling law Eq. 
(2). 



(in the log-log scales) dependence of the standard deviation of the running 
density fluctuations ((^(t) 2 ) 1 / 2 on r for the T-dominated subsequence of 
gene BRCA2. The straight line is drawn in this figure to indicate the scaling 
(2). The slope of this straight line provides us with the cluster-exponent 
a = 0.30 ± 0.02. Figure 3 shows analogous result for gene NRXN1 with 
a = 0.35 ± 0.02. One can see that in both cases we have rather strong 
cluster-scaling with the cluster-exponent different for the different genes. 



3. Hydrogen bonds 

The most popular potential for modeling the hydrogen (H) bond within a 
base-pair in the DNA chains is the Morse potential (see, for instance, Peyrard 
and Bishop, 1989; Dauxois et al., 1993; Campa and Giansanti, 1998; Hennig 
and Archilla, 2004): 



Viiyi) = A 



cxp 



2 m 



(3) 



where A is the site-dependent dissociation energy of the ith pair, which can 
take two values A = D A ~ T and A = D C ~ G for the A-T and the C-G pairs 
in the ith site respectively (the A-T pair includes two H bonds, while the 
C-G pair includes three H bonds, see Introduction); a" 1 is a measure of the 
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Figure 4: The standard deviation for 5n(r) vs r for A& T (circles) dominated sub-sequence 
of gene NRXN1 (in log- log scales). The straight lines (the best fit) indicate the scaling 
law Eq. (2). 



potential well width; variable yi is a dynamical deviation of the H bonds from 
their equilibrium lengths at position %. The ratio D C ~ G / D A ~ T = 1.5 is often 
used for the model purposes (though recent quantum chemical calculations 
by Sponer et al. 2001, results in a ratio D C ~ G / D A ~ T = 2). 

Randomly distributed along the DNA chain bivalued H-bond coupling 
strengths D A ~ T and D G ~ C are usually used in the DNA dynamics models. 
This would be appropriate for an arbitrary base pair sequence. However, as 
it is follows from previous consideration, homogeneous random distribution 
is not realistic even for the most long genes like NRXN1 (see, for instance, 
Li et al., 1998). The dynamic heterogeneous properties of DNA molecule 
was considered, for instance, as a reason for the so-called multi-step melting 
(Cule and Hwa, 1997). However, the assumption of a random and short-range 
(delta-) correlated sequence made by Cule and Hwa (1997) do not result in 
the multi-step melting and only an additional assumption of an additional 
backbone stiffness due to the double-stranded conformation of DNA molecule 
allowed to the authors to observe a multi-step melting in their model. In the 
model suggested by Jeon et al. (2007) the sequence randomness considered 
as a quenched noise with finite sequence correlation length. In this approach 
regions dominated by A-T or, alternatively, by C-G pairs play significant role 
in the bubble (i.e. locally denaturated states) formation. 
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Figure 5: The standard deviation for <5n(r) vs r for AT-TA (circles) and CG-GC (filled 
circles) dominated sub-sequence of gene BRC A2 (in log- log scales) . The straight lines (the 
best fit) indicate the scaling law Eq. (2). 



Taking into account the cluster-scaling of the DNA nucleotides is a natural 
step toward more realistic dynamical model. Because of the bivalued H-bond 
coupling strengths: Di = D A ~ T or Di = D G ~° , this can be readily done 
using following bivalued mapping: A — T — 1, C = G = or, alternatively, 
C = G = 1, A = T = 0. Figure 4 shows cluster-scaling behavior, Eq. (2), 
for the former mapping of NRXN1 gene. The cluster scaling exponent a = 
0.32±0.02 in this case. For C — G = 1, A = T = the mapping calculations 
give the same result as well as for corresponding mappings of the gene BRCA2 
(indication of an universality). Therefore, the bivalued sequences of the Di 
coefficients for the DNA dynamic chain should be chosen as cluster-scaling 
ones with certain cluster-exponent a Eq. (2) (for the considered genes a ~ 
0.32). 

4. Stacking interaction 

Two factors are mainly responsible for the stability of the DNA double 
helix: base pairing between complementary strands and stacking between 
adjacent bases (see Introduction). It is shown experimentally that DNA 
stability is mainly determined by base-stacking interactions which contribute 
greatly into the dependence of the duplex stability on its sequence, (see, for 
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Figure 6: The standard deviation for <5n.(r) vs r for AT-TA (circles) and CG-GC (filled 
circles) dominated sub-sequence of gene NRXN1 (in log-log scales). The straight lines (the 
best fit) indicate the scaling law Eq. (2). 

instance, a recent paper Yakovchuk et al., 2006). Therefore, it is interesting 
to check whether the staking interactions dominate also the above- considered 
cluster-scaling phenomenon (cf. Introduction). In order to check this let us 
use following mapping: in combination AT = TA = 11, and A = T = G = 
C = otherwise. An alternative mapping is: in combination CG = GC = 
1 1, and A = T = G = C = otherwise. If the stacking interactions dominate 
the cluster-scaling phenomenon, then one can expect that the cluster-scaling 
will be more pronounced just for these maps (cf. Fig. 1). It means that 
the cluster-exponent corresponding to these maps would be smaller than 
cluster-exponents observed for the above considered maps. As one can see 
comparing Figs. 5 and 6 with Figs. 2,3, and 4 in reality we have an opposite 
situation. This comparison indicates that the stacking interactions do not 
dominate the above-considered cluster-scaling phenomenon (at least for the 
examples given in the paper). 

In a realistic dynamic model of DNA molecule one should take into ac- 
count also the cluster-scaling of staking interaction itself as it is shown in 
Figs. 5 and 6 for instance (see also Ambjornsson et al., 2006; Krueger et. al., 
2006, for heterogeneity of both pairing and stacking interactions). This can 
be done in the frames of a commonly used approximation for the stacking 
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potential (see, for instance, Joyeux and Florescu 2009) 

= (l - exp (-b( Vi - y^) 2 )) (4) 

where Aif; can take different values for different staked pairs Be- 
cause the situation is not bivalued in this case this task seems to be more 
difficult than for the hydrogen bonds. The main problem here is hybridiza- 
tion of the nucleotides in different types of the stacked base-pairs (Fig. f). 
The fact that the cluster-scaling exponents for the different types of stacked 
base-pairs have approximately the same value can help to solve this problem. 
This is not the case, however, for hybridization problem if one will consider 
a realistic model taking into account cluster-scaling of both hydrogen bonds 
and staking interactions (the cluster-scaling exponents are different for hy- 
drogen bonds: Fig. 4, and for staking interactions: Figs. 5 and 6). 
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